suppressMessages(library(tidyverse))
suppressMessages(library(dplyr))
suppressMessages(library(lubridate))
suppressMessages(library(plotly))
suppressMessages(library(janitor))
suppressMessages(library(leaflet))
suppressMessages(library(colorRamps))
suppressMessages(library(proj4))
suppressMessages(library(validate))
suppressMessages(library(ggthemes))
suppressMessages(library(scales))
suppressMessages(library(broom))
In this project, we will be walking through the entire data science pipeline.
Many of the steps we will explain will be quite confusing without including some sort of file to follow along with. Therefore, this project will use a data set titled “Los Angeles Parking Tickets” to give examples of the concepts described, which will help the user better understand how to implement the steps of the pipeline we will discuss. Following these steps of the pipeline will help show the process someone takes to creating the point they are trying to make.
The first step in the pipeline is data curation, which is the organization and integration of data collected from a source. The data we will be using is taken from this website: https://www.kaggle.com/cityofLA/los-angeles-parking-citations
This file is a .csv file, which is a very common format used to store large amounts of data. It is called parking-citations.csv. Once this file is stored in the same folder of your computer as the document you are creating, it can be accessed. We will use a variable called parking_citations to access the data in this file. After importing the csv, we will call parking_citations to display what is actually stored in the file.
parking_citations<- read_csv("parking-citations.csv")
## Parsed with column specification:
## cols(
## `Ticket number` = col_double(),
## `Issue Date` = col_datetime(format = ""),
## `Issue time` = col_double(),
## `Meter Id` = col_character(),
## `Marked Time` = col_double(),
## `RP State Plate` = col_character(),
## `Plate Expiry Date` = col_double(),
## VIN = col_logical(),
## Make = col_character(),
## `Body Style` = col_character(),
## Color = col_character(),
## Location = col_character(),
## Route = col_character(),
## Agency = col_double(),
## `Violation code` = col_character(),
## `Violation Description` = col_character(),
## `Fine amount` = col_double(),
## Latitude = col_double(),
## Longitude = col_double()
## )
## Warning: 16479 parsing failures.
## row col expected actual file
## 614361 Ticket number no trailing characters D 'parking-citations.csv'
## 614362 Ticket number no trailing characters D 'parking-citations.csv'
## 614363 Ticket number no trailing characters D 'parking-citations.csv'
## 697463 Ticket number no trailing characters D 'parking-citations.csv'
## 697464 Ticket number no trailing characters D 'parking-citations.csv'
## ...... ............. ...................... ...... .......................
## See problems(...) for more details.
parking_citations
## # A tibble: 9,409,991 x 19
## `Ticket number` `Issue Date` `Issue time` `Meter Id`
## <dbl> <dttm> <dbl> <chr>
## 1 1103341116 2015-12-21 00:00:00 1251 <NA>
## 2 1103700150 2015-12-21 00:00:00 1435 <NA>
## 3 1104803000 2015-12-21 00:00:00 2055 <NA>
## 4 1104820732 2015-12-26 00:00:00 1515 <NA>
## 5 1105461453 2015-09-15 00:00:00 115 <NA>
## 6 1106226590 2015-09-15 00:00:00 19 <NA>
## 7 1106500452 2015-12-17 00:00:00 1710 <NA>
## 8 1106500463 2015-12-17 00:00:00 1710 <NA>
## 9 1106506402 2015-12-22 00:00:00 945 <NA>
## 10 1106506413 2015-12-22 00:00:00 1100 <NA>
## # … with 9,409,981 more rows, and 15 more variables: `Marked Time` <dbl>,
## # `RP State Plate` <chr>, `Plate Expiry Date` <dbl>, VIN <lgl>,
## # Make <chr>, `Body Style` <chr>, Color <chr>, Location <chr>,
## # Route <chr>, Agency <dbl>, `Violation code` <chr>, `Violation
## # Description` <chr>, `Fine amount` <dbl>, Latitude <dbl>,
## # Longitude <dbl>
Parsing is the process of choosing which pieces of information you need to do your analysis. Many csv files are very detailed and not every piece of information is needed. For our example, there is information about the car type and color that received the ticket. However, if we were just writing an article about the average pricing of a parking ticket, then this information is not necessary.
As you can see, this file is very large and we are unable to see every piece of information it holds at once. We can use a function called select to view specific parts of the file. Below, we will take a look at just the “Fine amount” column of the file. We can also look at multiple columns at once with this function.
select(parking_citations,("Fine amount"))
## # A tibble: 9,409,991 x 1
## `Fine amount`
## <dbl>
## 1 50
## 2 50
## 3 58
## 4 NA
## 5 93
## 6 50
## 7 163
## 8 163
## 9 93
## 10 93
## # … with 9,409,981 more rows
select(parking_citations,("Fine amount"),("Latitude"))
## # A tibble: 9,409,991 x 2
## `Fine amount` Latitude
## <dbl> <dbl>
## 1 50 99999
## 2 50 99999
## 3 58 6439998.
## 4 NA 6440041.
## 5 93 99999
## 6 50 99999
## 7 163 99999
## 8 163 99999
## 9 93 99999
## 10 93 99999
## # … with 9,409,981 more rows
Managing the data we are looking at is also an important step in the data science pipeline. This can include aspects such as creating new data variables, deleting data, and handling missing data. Oftentimes, there is missing data in the file. We cannot expect everything to be recorded and we have to find ways to deal with this missing data. There are mainly two methods we will use to work with the data that is missing. In our example, we will be using the latitude and longitude columns for our report. However, if you take a look at the latitude and logitude columns, you will see that missing data is inputted just a 99999.
select(parking_citations,("Longitude"))
## # A tibble: 9,409,991 x 1
## Longitude
## <dbl>
## 1 99999
## 2 99999
## 3 1802686.
## 4 1802686.
## 5 99999
## 6 99999
## 7 99999
## 8 99999
## 9 99999
## 10 99999
## # … with 9,409,981 more rows
The first method of handling data is known as imputation.With imputation, we can fill in the missing values with just the average of the values we are given. This is a very simple approach and can lead to your data being less accurate. If there are a lot of missing values, then the data will just begin to approach the average when this may not always be the case.
Since we are given such a large data set, we can afford to not use imputation and just skip over the missing values. This will keep our data as accurate as possible. For our example, we can do this by simply adding the line “filter(latitude!=99999)” into whichever pipeline we need to and it will take out all the information where the latitude is not given. You will see this line being used below when we make the interactive maps for the project.
Below, you can see the Total number of tickets that were issued during the previous week
## [1] 34619
After loading the data, we want to now put it into a form where we can see it better and it is more friendly than scrolling through the dataset. First we are going to make a graph of the tickets for the last week and how they are spread out throughout the different days in the week. One thing we can see from this graph is that on the weekends, the number of tickets drop drastically, I assume due to the fact that most people don’t work on those days and don’t need to drive anywhere.
The following graph shows the total number of tickets that where issued during the previous week, by date. It is made by first filtering the data so that only the last week is included, then just plotting the data and grouping it by date. Finally attaching labels to everything is the last step.
The next graph that we will show is showing the number of each type of violation on each different day in the last week. We use the color code for each different violation type to show the different violations in each day. As you can see, there are quite a lot of different violations so there are many colors in each day that’s why I added the ability to hover over each part of the bar graph and a tooltip will show allowing you to see which violation it is.
This is accomplished by first grouping by the violations again for the last week then arranging it by decreasing number of tickets in an attempt to make the graph look better, then finally the actual plotting is done using ggplot.
## Warning: position_stack requires non-overlapping x intervals
This graph shows the count of tickets per month across 5 years. It will show if any year or month is drastically different or if there is a clear trend in the tickets per month of per year. The reason tickets for 2019 in month 5 is so low is because we are only halfway through the month so it will be drastically lower than every other month.
This is done by creating a new year and month column in the parking_citation database then filtering the year to be past 2015 to decrease the number of lines on the graph to make it more readable. Next we group by year and month then summarise so that we only get one point per month and one line per year. Finally we plot the data and add special touches such as a theme, and labels for the x and y axis along with changing the color for each year line.
parking_citations %>%
mutate(Yr = year(`Issue Date`),
Mth = month(`Issue Date`)) %>%
filter(Yr >=2015) %>%
group_by(Yr, Mth) %>%
summarise(ct = n()) %>%
ggplot(aes(Mth, ct, color = factor(Yr))) +
geom_line() +
geom_point(size = 1) +
theme_hc() +
scale_fill_pander() +
labs(x = "Month", y = "Count", color = "Year") +
scale_y_continuous(limits = c(0,225000)) +
scale_x_continuous(breaks = seq(1,12,1))
Now we will look at the locations of the top 200 fines on tickets issued last week. We will be looking at these locations by creating an interactive map. This map will group datapoints together to make the map less cluttered then as you zoom in, the large bubbles will subdivide and break into smaller and smaller groups until you reach individual tickets when you can hover over the dot and see the location, date, time, fine amount, make, color, and the violation it was assigned for split by a bar for convenience.
To make this map, we start by filtering the last week tickets to only keep the top 200 fines. Next we convert latitude and longitude from feet coordinates given in the dataset to normal lat and long that we will use. We put these in a new column in the database. Finally we plot the data on the map by adding tiles and circle markers then just putting a label on each data point with location, date, time, fine amount, make, color, and the violation it was assigned for. Adding a key at the bottom right is the very last step just for added convenience when looking over the map.
From this map we can clearly see some sort of clustering of tickets prices, but can we actually determine if certain factors influence ticket price to some degree of statistical significance? This is where we will use hypothesis testing. We will use a linear regression model as we want to predict ticket prices based off of certain factors such as location.
Our final linear regression model will look something like this
\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12}x_1 x_2\]
Our null hypothesis is \(H_0: \beta_1 = \beta_2 = \beta_{12} = 0\) and our alternative is that not all of these values are 0. This means that some linear relationship exists.
Now we have to construct our model before analyzing it to see if we can reject our null hypothesis and to find out if we have a good model. We are also including an interaction between latitude and longitude because these are two related variables.
lmfit <- lm(`Fine amount`~Latitude*Longitude, parking_citations)
tidy(lmfit)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 7.04e+ 1 2.76e- 2 2551. 0
## 2 Latitude -5.84e- 9 1.65e- 8 -0.354 0.723
## 3 Longitude -1.20e- 7 5.57e- 8 -2.15 0.0319
## 4 Latitude:Longitude 3.14e-17 1.01e-17 3.11 0.00190
Note the p-value produced by lmfit in the last column extremely close to 0. This means that there is almost a 0 percent chance that the relationship between latitude and fine amount happened by random chance assuming the null is true. Therefore we can reject the null hypothesis that there is no relationship between these variables.
We can further test the hypothesis by computing an F statistic to describe how close each of the \(\beta\)s are to 0. If the statistic is significantly greater than 1, we can reject our null hypothesis.
lmfit %>%
glance() %>%
select(r.squared, sigma, statistic, df, p.value)
## # A tibble: 1 x 5
## r.squared sigma statistic df p.value
## <dbl> <dbl> <dbl> <int> <dbl>
## 1 0.00000838 32.1 26.3 4 5.54e-17
Although we have rejected our null hypothesis, according to our r squared term, it seems that our model does not accurately capture the relationship.
This is most likely because a linear model was not the best model if we are trying to use latitude and longitude to predict fine amount and we would have to find a better model in order to accurately make predictions. We can instead try to find something else that we can explain with a linear relationship such as make of the car. You might be confused as to how a qualitative variable can be used, but what r does in the backend is make dummy variables for each make which can take on values of 0 or 1 corresponding to false or true if the car is actually of that make. In our data, there are 55 different makes so we have 59 variables in our regression.
In addition, because the dataset is so big, it’s helpful to reduce the size while we are still figuring out our model so that we can run it on any computer. Here we use a random selection of 1000 data points as a sample.
sample_df <- parking_citations %>%
sample_n(1000)
lmfit2 <- lm(`Fine amount`~Make, sample_df)
tidy(lmfit2)
## # A tibble: 55 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 70.2 9.90 7.10 2.53e-12
## 2 MakeAUDI -9.85 12.0 -0.818 4.14e- 1
## 3 MakeBMW -3.63 10.9 -0.332 7.40e- 1
## 4 MakeBUIC 16.1 22.1 0.727 4.68e- 1
## 5 MakeCADI -6.08 17.1 -0.355 7.23e- 1
## 6 MakeCCH -20.2 35.7 -0.567 5.71e- 1
## 7 MakeCHEC -2.25 35.7 -0.0630 9.50e- 1
## 8 MakeCHEV -0.171 10.8 -0.0158 9.87e- 1
## 9 MakeCHRY 1.28 12.9 0.0989 9.21e- 1
## 10 MakeDODG 5.25 11.7 0.448 6.54e- 1
## # … with 45 more rows
We make the same observations as before that our p values are small meaning some sort of relationship exists.
Now if we look at are r squared value, we see that is much better than our model before even though it’s not perfect. This means we explained more about the ticket price by using the make, but that there is a lot more in our data that we haven’t explained yet.
lmfit2 %>%
glance() %>%
select(r.squared, sigma, statistic, df, p.value)
## # A tibble: 1 x 5
## r.squared sigma statistic df p.value
## <dbl> <dbl> <dbl> <int> <dbl>
## 1 0.0390 34.3 0.708 55 0.945
We can also take a look at the plot of fitted values vs residuals to see that there is not too much of an apparent pattern despite a couple outliers.
augment(lmfit2) %>%
ggplot(aes(x=.fitted,y=.resid)) +
geom_point() +
geom_smooth() +
labs(x="fitted", y="residual")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
In the end, we learned how to import and view data in r, tried to find some interesting details about the data, and then tried to fit a model to those observations to see if we could accurately predict prices of tickets. Altough we did not end up finding a perfect linear model, often times we will never end up finding a perfect model. What we have done however, is establish how to determine what makes a model good and we tried both location and make to try and predict ticket prices. We established that altough some sort of relationship exists, it cannot be entirely explained with just a linear model. In the future, we can try other types of regression and machine learning models to try and find one with the best fit.